Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Match viz dataframe column case to form_data fields for Snowflake, Oracle and Redshift #5487

Merged
merged 17 commits into from
Aug 3, 2018
Merged

Conversation

villebro
Copy link
Member

@villebro villebro commented Jul 25, 2018

This might be slightly hacky, but I feel SQL Alchemy gives so much latitude to engines that Superset might need to be slightly more forgiving if a query result has different case compared to the datasource/form metadata. A brief summary of what this PR does:

Adjust dataframe column case, by performing case-sensitive comparison of all colums in central fields in form_data (metrics, groupby) with column names in dataframe:

  • Leave all matches untouched.
  • Replace all dataframe columns with whichever case-insensitive variant is present in form_data.

Examples:

  • form_data: __timestamp, dataframe: __timestamp -> Do nothing.
  • form_data: __timestamp, dataframe: __TIMESTAMP -> Rename dataframe column name to __timestamp.
  • form_data: __timeSTAMP, dataframe: __TIMESTAMP -> Rename dataframe column name to __timeSTAMP.

The dataframe is adjusted prior to caching in BaseViz.get_df_payload(), with the logic for adjustments located in db_engine_specs, which is controlled by the consistent_case_sensitivity attribute. To minimize risk of collisions, dedup has been changed to be able to perform deduping both in a case-sensitive and case-insensitive manner. Default handling would now be case-sensitive, as before, but for affected engines (Snowflake, Oracle, Redshift) handling will be case-insensitive. Example from test case (note the last Bar which is seen as a duplicate despite different case):

>>> print(','.join(dedup(['foo', 'bar', 'bar', 'bar', 'Bar'], case_sensitive=False)))
foo,bar,bar__1,bar__2,Bar__3

Other changes:

  • Add schema URI insertion for Snowflake db_engine_spec
  • Fix what seemed to be a small bug in how handle_nulls() was implemented
  • Remove a redundant full_table_name row in db_engine_spec

Feel free to try it out, and don't feel shy to give feedback/critique. Like stated above, this might be borderline hacky, but seems to work fairly well, and might make it easier to add new engines later on. This fix should also be backwards compatible (except for the handle_nulls()-part, which was recently added), at least 0.26 and 0.25.

@codecov-io
Copy link

codecov-io commented Jul 25, 2018

Codecov Report

Merging #5487 into master will decrease coverage by 0.03%.
The diff coverage is 51.11%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5487      +/-   ##
==========================================
- Coverage   63.12%   63.08%   -0.04%     
==========================================
  Files         349      349              
  Lines       22167    22203      +36     
  Branches     2462     2462              
==========================================
+ Hits        13992    14006      +14     
- Misses       8161     8183      +22     
  Partials       14       14
Impacted Files Coverage Δ
superset/dataframe.py 94.59% <100%> (+0.09%) ⬆️
superset/viz.py 77.47% <100%> (+0.04%) ⬆️
superset/db_engine_specs.py 54.19% <31.25%> (-1.21%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d95c4c...c13415d. Read the comment docs.

superset/viz.py Outdated
@@ -467,6 +468,36 @@ def get_data(self, df):
def json_data(self):
return json.dumps(self.data)

def fix_df_column_case(self, df):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't really belong in viz.py, I'd much rather have this in db_engine_spec.py and only execute for these weird database. Maybe it's a method in BaseEngineSpec that applies conditionally based no a class attribute.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I'll move it over.

@villebro villebro changed the title Match viz dataframe column case to form_data fields Match viz dataframe column case to form_data fields for Snowflake, Oracle and Redshift Jul 26, 2018
@villebro
Copy link
Member Author

villebro commented Jul 26, 2018

This should now be ready for a new round of review and testing. I've updated the description to reflect the current state of the PR.

@mistercrunch Can you take a new look at this?
@minh5 Do you have the opportunity to test this on Redshift?
@mmuru Could you test this against your Snowflake datasources?

Again, all comments more than welcome.

@mistercrunch
Copy link
Member

LGTM would merge. Holding a bit for confirmation from people tagged here.

@mmuru
Copy link
Contributor

mmuru commented Jul 26, 2018

@villebro: I verified your PR using your branch and it worked against Snowflake data source.

@villebro
Copy link
Member Author

Thanks @mmuru for confirming!

@villebro
Copy link
Member Author

@mistercrunch This has now been confirmed on both Snowflake and Redshift. I ended up making a small refactor after your review (commit get metrics names from utils) and rebasing, but apart from that the main functionality is unchanged. Feeling confident that this is now mergeworthy. Has been confirmed to fix #5489. Thanks @mmuru and @keeyong for testing!

@minh5
Copy link
Contributor

minh5 commented Jul 27, 2018

It appears to work for me. The UI gets a little weird when I'm not using Superset not installed from pip (this includes source). But I have a series of queries using Redshift and the server doesn't return the SUM(column_name) errors that I usually experience.

@villebro
Copy link
Member Author

Thanks for testing @minh5 . Did you do the whole npm run dev etc routine as described in https://github.com/apache/incubator-superset/blob/master/CONTRIBUTING.md#setting-up-the-node--npm-javascript-environment ?

@mistercrunch
Copy link
Member

FYI there are some unrelated bugs on master at the moment.

@villebro
Copy link
Member Author

This also fixes #5353 , aka the handle_nulls() bug.

@mistercrunch
Copy link
Member

Would merge, please resolve conflict

@villebro
Copy link
Member Author

@mistercrunch FYI rebased and tested locally to work.

@villebro
Copy link
Member Author

villebro commented Aug 2, 2018

@mistercrunch Rebased and ready to merge again. py36-sqlite CI failure seems to be a technicality:

No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.

Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#Build-times-out-because-no-output-was-received

The build has been terminated

@villebro
Copy link
Member Author

villebro commented Aug 3, 2018

Rebased to kickstart CI, now clean bill of health. Ready for merging @mistercrunch

@mistercrunch mistercrunch merged commit e1f4db8 into apache:master Aug 3, 2018
wenchma pushed a commit to wenchma/incubator-superset that referenced this pull request Nov 16, 2018
…acle and Redshift (apache#5487)

* Add function to fix dataframe column case

* Fix broken handle_nulls method

* Add case sensitivity option to dedup

* Refactor function definition and call location

* Remove added blank line

* Move df column rename logit to db_engine_spec

* Remove redundant variable

* Update comments in db_engine_specs

* Tie df adjustment to db_engine_spec class attribute

* Fix dedup error

* Linting

* Check for db_engine_spec attribute prior to adjustment

* Rename case sensitivity flag

* Linting

* Remove function that was moved to db_engine_specs

* Get metrics names from utils

* Remove double import and rename dedup variable
@machinoAI
Copy link

I am reading the data from redshift and trying to create chart in superset by grouping month wise but could not able to do since there is no option of grouping. What should I do ?

@villebro villebro deleted the preprocess_df branch December 10, 2018 13:15
@villebro
Copy link
Member Author

@adderRavi can you elaborate on what you are trying to do? Sounds like you want to use a month time grain. Also, if you are having trouble with RedShift I would appreciate if you could try #5827 as it fixes all problems I have been able to identify with RedShift and a few other SQL Alchemy engines.

@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.28.0 labels Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.28.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants